Introduction

With the increased computing power, Machine Learning algorithms are a top choice for handling large datasets, building robust models, and fine-tuning the parameters since operations and processes can be designed to run in parallel. There has been many recent papers on applications of machine learning in social science. For instance, Nyman & Ormerod compared ordinary least squares regression results with random forest regression results in predicting economic recession and obtained a considerably higher adjusted R-squared with random forest regression (Nyman & Ormerod, 2017). A more detailed discussion of utilizing various Machine Learning algorithms for predicting economic growth and recession has been published in the 2017 book, Machine Learning Techniques in Economics (Basuchoudhary, Bang, & Sen, 2017). In the field of environmental science, a Feb. 2017 paper surveyed the accuracy of various machine learning algorithms, including LASSO regression, Random Forests, and neural networks, in predicting ragweed pollen concentration (Liu et al., 2017).

In both Nyman & Ormerod’s and Liu’s examples, random forest has proven to be the best-performing, both in terms of prediction accuracy and variable selection. It was, however, not available as a Stata package until Summer 2017, when the authors of this paper developed a java-based Stata implementation of the algorithm. In this paper, we will briefly discuss the mathematical background of Random Forest. We will then illustrate two examples in Stata, one for predicting the British General Election outcome (Fieldhouse et al., n.d.) and the other for estimating log-scaled income from the 2016 U.S. Consumer Finance Survey (“Full public data set - full data sets of all scf variables.” n.d.).

Mathematical Background

Tree-based models have long been popular analytical tools because of their versatility and interpretability. A tree-based model generally involves recursively partitioning the given data set into two groups based on a certain criterion until a pre-determined stopping condition is met. Figure 1 below illustrates a recursive partitioning of a 2-dimensional input space with axis-aligned boundaries. At the bottom of decision trees are leaf nodes that correspond to the target values.

Figure 2 is a graphical representation of the subspaces partitioned in Figure 1.

Depending on how the partition and stopping criteria are set, decision trees can be designed both for classification and regression tasks.

For both classification and regression problems, the subset of predictor variables selected to split an internal node depends on pre-determined splitting criteria which is formulated as an optimization problem. A common splitting criterion in classification problems is entropy, which is the practical application of Shannon’s source coding theorem that specifies the lower bound on the length of a random variable’s bit representation (Shannon, 2001). At each internal node of the decision tree, entropy is given by the formula \[E = \sum_{i = 1}^{c}p_i \times log(p_i)\] where \(c\) is the number of unique classes and \(p_i\) is the prior probability of each given class. This value is strategically maximized in order to gain the most information at every split of the decision tree. Entropy is used as a splitting criterion in the WEKA Random Forest implementation as well as the Stata plugin used in this paper. For regression problems, a commonly used splitting criterion is the mean squared error or root mean squared error at each internal node. Root mean squared error is strategically minimized at every split in order to minimize the final MSE.

One obvious problem with decision trees is that they are prone to over-fitting, which happens when the model becomes too “catered” towards the training data set that it performs poorly on a data set it has never seen before, i.e. the test data set. Since the decision tree can be grown infinitely large, it is hard to approximate the prediction errors.

One way to increase generalization accuracy is to only consider a subset of the samples and build many individual trees. First introduced by Ho, this idea of the random subspace method was later extended and formally presented as Random Forests by Breiman in 2001 (Breiman, 2001; Ho, 1998). Random Forests is an ensemble tree-based learning algorithm. It builds on the fundamental concept of decision trees and utilizes bootstrap aggregation, also known as bagging, to prevent overfitting and improve generalization accuracy. The algorithm is as follows:

for i in 1:B:
    draw a bootstrap sample with size N from the training data
    while node size != minimum node size, grow a tree by:
        randomly select a subset of m predictor variables from total p
        for j in 1:m:
            if the j-th predictor optimizes splitting criterion:
                split internal node into two child nodes
                break
output the ensemble tree composed of all B sub-trees grown in the for loop
regression prediction:
    f_hat(x) = sum of all sub-tree predictions divided over B trees
classification prediction:
    f_hat(x) = majority vote of all predicted classes over B trees

One of the many advantages of the random forest algorithm is that it gives a more accurate estimate of the error rate, comparing with decision trees. More specifically, in Breiman’s paper, the error rate has been mathematically proven to always converge. In a classification problem, the error rate is approximated by the out-of-bag error during the training process. At each sub-tree of the random forest, the out-of-bag error is calculated by taking observations that were omitted from the sub-tree’s training set and fitting them over the sub-tree to get the prediction error. In a regression problem, the out-of-bag error is equivalent to the root mean squared error at each sub-tree. Finding hyper parameters that would produce a low out-of-bag error is often a key consideration in model selection and parameter tuning. Note that in the random forest algorithm, the size of the subset of predictor variables, \(m\), is crucial to controling the final depth of the trees. Hence it is a hyper-parameter that needs to be tuned during model selection, which will be discussed in the examples later.

Another factor of interest in model selection is variable importance, which is calculated by adding up the improvement in the objective function given in the splitting criterion over all internal nodes of a sub-tree, then summed over all sub-trees, separately for each predictor variable. In Stata implementation of random forest, the variable importance score is normalized by dividing all scores over the maximum score.

Stata Syntax

The Stata syntax to fit a random forest model is:

randomforest depvar indepvars [if] [in] , [ options ]

with the following post-estimation command:

predict newvar | varlist | stub* [if] [in] , [ pr ]

Example: British Election Data

The following example uses a data set from the British Election Study (Fieldhouse et al., n.d.). The data set contains key statistics from the 2017 General Election results, organized by constituency. We will use features such as the size and voter turnouts to predict the election outcome for each constituency, which is a categorical variable with seven levels – i.e. the winning party of each constituency is one of seven total parties. Below is a formal setup of the problem:

Let \(x_i \in \rm I\!R^d\) be a constituency from the election survey data set, with each \(x_{i,j}\) representing an observed value of the \(j\)-th feature, and corresponding response variable \(y_i \in \{1, 2, \cdots 7\}, y_i \in \Bbb Z\). Given \(N\) observations, construct a model that best predicts the response variable.

To approach this problem, one could use parametric methods such as multiple logistic regression, or non- parametric methods such as Random Forests. Both these options will be explored to compare their performance and accuracy.

First load the data set and randomize. This is necessary if the data has been pre-ordered, in which case a subset selected as the training set may only include a single class and would produce an error message in Stata.

clear
use "BES-2017-General-Election-results-file-v1.0.dta"
set seed 201711
gen u=uniform()
sort u

Next, tune the parameters to find the optimal model. Since number of iterations (i.e. number of sub-trees) and number of variables to randomly investigate at each split are the dominating factors for the overall shape of the random forest, those two will be the parameters to optimize. The following code segment iteratively calculates the prediction accuracy for variable number of iterations and a fixed value for the variable \(numvars\). The number of iterations starts at \(10\) and is incremented by \(5\) every time until it reaches \(2000\).

gen out_of_bag_error1 = .
gen iter1 = .
local j = 0
forvalues i = 10(5)2000{
    local j = `j' + 1
    randomforest Winner17 *sex17 *Vote17 Country Turnout17 Electorate17 Majority17 ///
        Region ConstituencyType, type(class) iter(`i') numvars(1)   
    replace iter1 = `i' in `j'
    replace out_of_bag_error1 = `e(OOB_Error)' in `j'
}

Note that the wildcard character, “*“, used in the code selects all variables whose names end with”sex17" and “Vote17”. In other words, we are querying for all the genders of the candidates participating in the 2017 election for each party at each constituency, as well as the number of votes captured by each party at each constituency. Some parties might not have candidates running for every single constituency, which would cause some values to be missing in those variables. This is not an issue for our random forest model since the sub-trees are simply partitioning the data set.

Repeat the above process three more times by generating new \(out\_of\_bag\_error\) and \(iter\) variables with variable names corresponding to \(numvars = \{5, 10, 15, 22\}\).

Typically the \(numvars\) field is tuned using line search or grid search if we were tuning multiple hyper-parameters at the same time. For better illustration, we have decided to tune this parameter using a few benchmark values instead of a thorough line search. This same approach is used for the regression example in the next section. The algorithm performance difference would be clear once the out-of-bag errors are plotted.

A more rigorous model building process involves both variable selection and model parameter tuning. Since this data set is compiled from surveys and census data, we can rely on prior knowledge to narrow down the range of predictor variables applicable to our problem setup. Specifically, since this data set is a compilation of various election-related metrics and results gathered between 2011 and 2017, only 2017 voting and demographics-related variables have been selected since these are the most recent and relevant predictors. For full documentation on the dataset and the list of predictor variables used, please consult Appendix A.

Next, plot the out-of-bag errors against the number of iterations for all values of numvars using the code segment below:

scatter out_of_bag_error1 iter1, mcolor(blue) msize(tiny) || ///
scatter out_of_bag_error5 iter5, mcolor(dkgreen) msize(tiny) || ///
scatter out_of_bag_error10 iter10, mcolor(orange) msize(tiny) || ///
scatter out_of_bag_error15 iter15, mcolor(red) msize(tiny) || ///
scatter out_of_bag_error22 iter22, mcolor(black) msize(tiny)

We obtain Figure 3:

From the figure, we could see that for a large enough number of iterations, \(iter \geq 1500\), the difference in prediction error estimated by out-of-bag error is marginal for \(num.vars = \{10, 15, 22\}\). Hence in practice, for faster computation, we would choose the lowest value, \(num.vars = 10\). As for number of iterations, since the ouf-of-bag error behaviour is consistent for iterations \(1500\) and greater, we would choose \(iter = 1500\) for computational efficiency.

Now that we have decided on the hyper-parameters, we move on to calculate the variable importance score of each predictor. As a reminder, the variable importance score was previously defined in the mathematical background section as a way to see how influential the variables are in the final model.

The following code segment plots the variable importance:

clear
use BES-2017-General-ELection-results-file-v1.0.dta
randomforest Winner17 *sex17 *Vote17 Country Turnout17 Electorate17 Majority17 ///
Region ConstituencyType, type(class) iter(1500) numvars(10) 

matrix importance = e(importance)
svmat importance
gen id=""

local mynames : rownames importance
local k : word count `mynames'
if `k'>_N {
    set obs `k'
    }
forvalues i = 1(1)`k' {
    local aword : word `i' of `mynames'
    local alabel : variable label `aword'
    if ("`alabel'"!="") qui replace id= "`alabel'" in `i'
    else qui replace id= "`aword'" in `i'
    }

graph hbar (mean) importance, over(id, sort(1) label(labsize(2))) ytitle(Importance)

Now if we tried to solve this classification problem with multinomial logistic regression using the following code:

mlogit Winner17 *sex17 *Vote17 Country Turnout17 Electorate17 ///
    Majority17 Region ConstituencyType

We will see from the output that there is a computation problem caused by missing values in the \(\texttt{*sex17}\) and \(\texttt{*Vote17}\) variables. This is not an issue in random forest since it is a tree-based model. To perform multiple logistic regression using the same set of predictors, the missing values need to be imputed. The Stata commands \(\texttt{mvpatterns *sex17}\) and \(\texttt{mvpatterns *Vote17}\) show the patterns of missing values for the two sets of predictors. We can then use algorithms such as Expectation Maximization or Multivariate Imputation by Chained Equations to impute the missing values.

In conclusion, building a random forest model on this data set allows us to achieve an estimated 96% accuracy in predicting the election outcomes of each constituency, with the hyper-parameter values set as \(num.vars = 10\) and \(iter = 1500\), using the specified 22 predictors including candidate gender, party-specific vote counts, turnout rate, and etc.

Example: Survey of Consumer Finances

The following example uses the 2016 Survey of Consumer Finances, which is publically available on the Federal Reserve’s official website (“Full public data set - full data sets of all scf variables.” n.d.). In this example, we are predicting log-scaled household income based on a mixture of \(20\) numerical and categorical predictors detailed in Appendix B. We will compare the random forest model against linear regression.

Re-set the number of max variables in Stata so that the entire data set can be properly loaded, randomize like we did for the previous example, then generate a new variable for log-scaled income:

clear
clear matrix
set maxvar 30000
use p16i6
set seed 201711
gen u = uniform()
sort u
gen logIncome = ln(X5702)
replace logIncome = 0 if logIncome == .

Since this data set is large enough, we will use a 50-50 split to partition it into training and testing set. We then perform parameter tuning using the same benchmark value-based approach as we did in the previous example:

gen iter5 = .
gen prediction_rmse5 = .
gen oob_err5 = .
local j = 0
forvalues i = 10(5)500{
    local j = `j' + 1
    randomforest logIncome X5701 X7402 X6792 X6791 X3501 X7063 X7805 X7801 X3101 ///
     X3103 X1301 X1336 X1101 X723 X501 X7973 X7126 X7122 X7593 X7592 in 1/15620, ///
     type(reg) iter(`i') numvars(5)
    predict prf in 15621/31240
    replace iter5 = `i' in `j'
    replace prediction_rmse5 = `e(RMSE)' in `j'
    replace oob_err5 = `e(OOB_Error)' in `j'
    drop prf
}

Do this for \(numvars = \{10, 15, 20\}\), then plot the prediction RMSE as well as the out-of-bag errors against number of iterations:

scatter oob_err5 iter5, mcolor(dkgreen) msize(tiny) || ///
scatter oob_err10 iter10, mcolor(blue) msize(tiny) || ///
scatter oob_err15 iter15, mcolor(red) msize(tiny) || ///
scatter oob_err20 iter20, mcolor(black) msize(tiny)
scatter prediction_rmse5 iter5, mcolor(dkgreen) msize(tiny) || ///
scatter prediction_rmse10 iter10, mcolor(blue) msize(tiny) || ///
scatter prediction_rmse15 iter15, mcolor(red) msize(tiny) || ///
scatter prediction_rmse20 iter20, mcolor(black) msize(tiny)

We get the following two plots:

We can see from the graphs that while out-of-bag error underestimates the actual prediction RMSE, it does accurately capture the contrasting effects on the error rates when the value of \(\texttt{num.vars}\) varies. For large enough iterations, \(\texttt{iter >= 300}\), \(\texttt{num.vars = 5}\) gives the lowest prediction RMSE which is around \(0.88\).

To compare random forest against parametric methods, we will use linear regression on the same training and testing subsets by running the following commands:

regress logIncome X5701 X7402 X6792 X6791 X3501 X7063 X7805 X7801 X3101 
        X3103 X1301 X1136 X1101 X723 X501 X7973 X7126 X7122 X7593 X7592 in 1/15620
predict preg in 15621/31240
eret li

The root mean squared error is \(1.10990\), about 25% higher than the lowest RMSE we’ve obtained and still higher than the highest RMSE we’ve obtained in the random forest models.

References

Basuchoudhary, A., Bang, J. T., & Sen, T. (2017). Machine-learning techniques in economics: New tools for predicting economic growth. Springer International Publishing.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

Fieldhouse, E., Green, J., Evans, G., Schmitt, H., Eijk, C. van der, Mellon, J., & Prosser, C. (n.d.). British election study 2017 constituency results file, version 1.0. Retrieved from http://www.britishelectionstudy.com/data-object/2017-bes-constituency-results-with-census-and-candidate-data/

Full public data set - full data sets of all scf variables. (n.d.). Survey of Consumer Finances. Board of Governors of the Federal Reserve System (U.S.). Retrieved from https://www.federalreserve.gov/econres/files/scf2016s.zip

Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844.

Liu, X., Wu, D., Zewdie, G. K., Wijerante, L., Timms, C. I., Riley, A., … Lary, D. J. (2017). Using machine learning to estimate atmospheric ambrosia pollen concentrations in tulsa, ok. Environmental Health Insights, 11. https://doi.org/10.1177/1178630217699399

Nyman, R., & Ormerod, P. (2017). Predicting economic recessions using machine learning algorithms.

Shannon, C. E. (2001). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), 3–55.